37 research outputs found
An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models
Knowledge distillation (KD) is a well-known method for compressing neural
models. However, works focusing on distilling knowledge from large multilingual
neural machine translation (MNMT) models into smaller ones are practically
nonexistent, despite the popularity and superiority of MNMT. This paper bridges
this gap by presenting an empirical investigation of knowledge distillation for
compressing MNMT models. We take Indic to English translation as a case study
and demonstrate that commonly used language-agnostic and language-aware KD
approaches yield models that are 4-5x smaller but also suffer from performance
drops of up to 3.5 BLEU. To mitigate this, we then experiment with design
considerations such as shallower versus deeper models, heavy parameter sharing,
multi-stage training, and adapters. We observe that deeper compact models tend
to be as good as shallower non-compact ones, and that fine-tuning a distilled
model on a High-Quality subset slightly boosts translation quality. Overall, we
conclude that compressing MNMT models via KD is challenging, indicating immense
scope for further research.Comment: accepted at EAMT 202
MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation
There have been several meta-evaluation studies on the correlation between
human ratings and offline machine translation (MT) evaluation metrics such as
BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate
simultaneous speech translation (SST) but their correlations with human ratings
of SST, which has been recently collected as Continuous Ratings (CR), are
unclear. In this paper, we leverage the evaluations of candidate systems
submitted to the English-German SST task at IWSLT 2022 and conduct an extensive
correlation analysis of CR and the aforementioned metrics. Our study reveals
that the offline metrics are well correlated with CR and can be reliably used
for evaluating machine translation in simultaneous mode, with some limitations
on the test set size. We conclude that given the current quality levels of SST,
these metrics can be used as proxies for CR, alleviating the need for large
scale human evaluation. Additionally, we observe that correlations of the
metrics with translation as a reference is significantly higher than with
simultaneous interpreting, and thus we recommend the former for reliable
evaluation.Comment: IWSLT 202